- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.4k
sycl : Implemented reorder Q4_K mmvq #13109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Could you share the GPU type of above test result?
- Have you test the PR by local UT?
- Could you check the detailed output of Q4_K LLM?
 I guess the output should be different to legacy code.
        
          
                ggml/src/ggml-sycl/ggml-sycl.cpp
              
                Outdated
          
        
      | // Dispatch becomes obscure with the reorder, MMVQ when the reorder optimization | ||
| // is enabled takes precedence over DMMV, the current if-else implementation | ||
| // requires disabling DMMV if both conditions are met | ||
| || (reorder && ggml_sycl_supports_reorder_mmvq(src0->type))) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have same comment and concern:
This code will impact the code path of below. That would lead to the wrong result.
I suggest this PR only optimize the mmvq() function.
You could add another PR to optimize by changing the code path, like  by this sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! Let's focus on the review of #12858 firtly.
| } else if (use_dequantize_mul_mat_vec) { | ||
| ggml_sycl_op_mul_mat(ctx, src0, src1, dst, ggml_sycl_op_dequantize_mul_mat_vec, false); | ||
| // save_tensor_txt("1/dst_1.txt", (float*) dst->data, src0->ne[1], sizeof(float), ctx.stream()); | ||
| constexpr bool convert_src1_to_q8_1 = false; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you follow the solution of PR #13003?
It fixed base issue of reorder Q4_0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rebased this branch and will rebase it again when #12858 is merged, so these changes should make it into this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the discussion WRT  opt_for_reorder and how to call the function, this will require another rebase, sorry about that. Will keep it to a single commit so you can cherry pick with no issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! Let's focus on the review of #12858 firtly.
| @sgeor255 We need promote SYCL backend in related cases. :) | 
d1f5b2d    to
    105a01d      
    Compare
  
    | I rebased the PR on @Alcpz 's latest changes & updated the description with more performance numbers. | 
| @NeoZhangJianyu to answer your questions: 
 I updated the PR description with results from more devices. 
 Unit tests pass locally (if I understood the question correctly?). 
 I ran the example script with  master @ 8936784This PR | 
| @sgeor255 I cannot resolve my comments (because the "resolve conversation " button is just isn't there for me), consider them resolved 👍🏻 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM
| } else if (use_dequantize_mul_mat_vec) { | ||
| ggml_sycl_op_mul_mat(ctx, src0, src1, dst, ggml_sycl_op_dequantize_mul_mat_vec, false); | ||
| // save_tensor_txt("1/dst_1.txt", (float*) dst->data, src0->ne[1], sizeof(float), ctx.stream()); | ||
| constexpr bool convert_src1_to_q8_1 = false; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the discussion WRT  opt_for_reorder and how to call the function, this will require another rebase, sorry about that. Will keep it to a single commit so you can cherry pick with no issues.
105a01d    to
    685e02b      
    Compare
  
    | 
 https://github.com/NeoZhangJianyu there's a small improvement for this model too: 
 
 build: 105a01d (5223) 
 
 build: 105a01d (5223) | 
685e02b    to
    d7e5179      
    Compare
  
    d7e5179    to
    bb10b4a      
    Compare
  
    bb10b4a    to
    f7e7d2a      
    Compare
  
    | This PR is now rebased on master as #12858 was merged. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM so far
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall!
f7e7d2a    to
    2a2aef0      
    Compare
  
    |  | ||
| static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { | ||
|  | ||
| static bool can_use_dequantize_mul_mat_vec(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| static bool can_use_dequantize_mul_mat_vec(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { | |
| static bool choose_dequantize_mul_mat_vec(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nit but can_use seems more accurate to me since there are more logic later on to make the final decision on the matmul implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use one function to detect if need call deq_mul_mat_vec(), instead of several.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that the final choice depends on the outputs from can_use_dequantize_mul_mat_vec and can_use_mul_mat_vec_q so it can't all be in a single choose_dequantize_mul_mat_vec currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree with @Rbiessy and that's why it is currently implemented this way.
| src0->ne[0] % GGML_SYCL_DMMV_X == 0 && src1->ne[1] == 1; | ||
| } | ||
|  | ||
| static bool can_use_mul_mat_vec_q(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| static bool can_use_mul_mat_vec_q(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { | |
| static bool choose_mul_mat_vec_q(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { | 
| Merging now since this PR includes an important fix with the reorder optimization mentioned here: #13109 (comment) | 
This PR enables reorder optimization for Q4_K layout similarly to #12858 . This branch is based off of @Alcpz 's and before that is merged the easiest way to review it is looking at the diff for d1f5b2d .
Some performance numbers below:
Lunar lake
GGML_SYCL_DISABLE_OPT=0build: 105a01d (5223)
GGML_SYCL_DISABLE_OPT=1build: 105a01d (5223)
Arc B580 (Battlemage)
GGML_SYCL_DISABLE_OPT=0build: 105a01d (5223)
GGML_SYCL_DISABLE_OPT=1build: 105a01d (5223)
Arc A770
GGML_SYCL_DISABLE_OPT=0build: 105a01d (5223)
GGML_SYCL_DISABLE_OPT=1build: 105a01d (5223)